Search CORE

2 research outputs found

The Gender-GAP Pipeline: A Gender-Aware Polyglot Pipeline for Gender Characterisation in 55 Languages

Author: Alastruey Belen
Andrews Pierre
Costa-jussà Marta R.
Hansanti Prangthip
Kalbassi Elahe
Muller Benjamin
Ropers Christophe
Smith Eric Michael
Williams Adina
Zettlemoyer Luke
Publication venue
Publication date: 31/08/2023
Field of study

Gender biases in language generation systems are challenging to mitigate. One possible source for these biases is gender representation disparities in the training and evaluation data. Despite recent progress in documenting this problem and many attempts at mitigating it, we still lack shared methodology and tooling to report gender representation in large datasets. Such quantitative reporting will enable further mitigation, e.g., via data augmentation. This paper describes the Gender-GAP Pipeline (for Gender-Aware Polyglot Pipeline), an automatic pipeline to characterize gender representation in large-scale datasets for 55 languages. The pipeline uses a multilingual lexicon of gendered person-nouns to quantify the gender representation in text. We showcase it to report gender representation in WMT training data and development data for the News task, confirming that current data is skewed towards masculine representation. Having unbalanced datasets may indirectly optimize our systems towards outperforming one gender over the others. We suggest introducing our gender quantification pipeline in current datasets and, ideally, modifying them toward a balanced representation.Comment: 15 page

arXiv.org e-Print Archive

SeamlessM4T-Massively Multilingual & Multimodal Machine Translation

Author: Akula Bapi
Andrews Pierre
Balioglu Can
Barrault Loïc
Celebi Onur
Chen Peng-Jen
Chung Yu-An
Communication Seamless
Costa-jussà Marta R.
Dale David
Dong Ning
Duquenne Paul-Ambroise
Elbayad Maha
Ellis Brian
Elsahar Hady
Gao Cynthia
Gong Hongyu
Gonzalez Gabriel Mejia
Guzmán Francisco
Haaheim Justin
Hachem Naji El
Hansanti Prangthip
Heffernan Kevin
Hoffman John
Howes Russ
Huang Bernie
Hwang Min-Jae
Inaguma Hirofumi
Jain Somya
Kalbassi Elahe
Kallet Amanda
Kao Justine
Klaiber Christopher
Kulikov Ilia
Lam Janice
Lee Ann
Li Daniel
Li Pengwei
Licht Daniel
Ma Xutai
Maillard Jean
Mavlyutov Ruslan
Meglioli Mariano Cora
Mourachko Alexandre
Peloquin Benjamin
Pino Juan
Popuri Sravya
Rakotoarison Alice
Ramadan Mohamed
Ramakrishnan Abinesh
Ropers Christophe
Sadagopan Kaushik Ram
Saleem Safiyyah
Schwenk Holger
Sun Anna
Tomasello Paden
Tran Kevin
Tran Tuan
Tufanov Igor
Vogeti Vish
Wang Changhan
Wang Jeff
Wang Skyler
Wenzek Guillaume
Wood Carleigh
Yang Yilin
Ye Ethan
Yu Bokai
Publication venue
Publication date: 23/08/2023
Field of study

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages? While recent breakthroughs in text-based models have pushed machine translation coverage beyond 200 languages, unified speech-to-speech translation models have yet to achieve similar strides. More specifically, conventional speech-to-speech translation systems rely on cascaded systems that perform translation progressively, putting high-performing unified systems out of reach. To address these gaps, we introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-speech translation, text-to-text translation, and automatic speech recognition for up to 100 languages. To build this, we used 1 million hours of open speech audio data to learn self-supervised speech representations with w2v-BERT 2.0. Subsequently, we created a multimodal corpus of automatically aligned speech translations. Filtered and combined with human-labeled and pseudo-labeled data, we developed the first multilingual system capable of translating from and into English for both speech and text. On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation. Compared to strong cascaded models, SeamlessM4T improves the quality of into-English translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in speech-to-speech. Tested for robustness, our system performs better against background noises and speaker variations in speech-to-text tasks compared to the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and added toxicity to assess translation safety. Finally, all contributions in this work are open-sourced and accessible at https://github.com/facebookresearch/seamless_communicatio

arXiv.org e-Print Archive